Memory augmented recurrent neural networks for de-novo drug design

A recurrent neural network (RNN) is a machine learning model that learns the relationship between elements of an input series, in addition to inferring a relationship between the data input to the model and target output. Memory augmentation allows the RNN to learn the interrelationships between elements of the input over a protracted length of the input series. Inspired by the success of stack augmented RNN (StackRNN) to generate strings for various applications, we present two memory augmented RNN-based architectures: the Neural Turing Machine (NTM) and the Differentiable Neural Computer (DNC) for the de-novo generation of small molecules. We trained a character-level convolutional neural network (CNN) to predict the properties of a generated string and compute a reward or loss in a deep reinforcement learning setup to bias the Generator to produce molecules with the desired property. Further, we compare the performance of these architectures to gain insight to their relative merits in terms of the validity and novelty of the generated molecules and the degree of property bias towards the computational generation of de-novo drugs. We also compare the performance of these architectures with simpler recurrent neural networks (Vanilla RNN, LSTM, and GRU) without an external memory component to explore the impact of augmented memory in the task of de-novo generation of small molecules.


Dataset profile
The dataset for the generator was sampled such that molecular weight did not exceed 900 daltons. This was chosen to ensure that the generator was trained only on small molecules. The number of strings for the generator were 788,452. We did not canonicalize strings in the dataset nor remove duplicates. Among the set of 788,452 strings 212,196 molecules were in canonical form. The set of 788,452 strings included 29,656 duplicates.
For the predictor, there may be cases where the input to it may not be a small molecule. E.g. in cases where the generator ends up generating molecules with weight > 900 daltons. Hence, in order to cover all the possibilities, the sample was 200,000 molecules without any filtering.
In section 5.2.2, it can be seen that the properties of the sample of 200,000 are similar to the complete dataset. Only range varies, but that is expected as that is sensitive to outliers.

Bar plot of atomic types in the dataset
Number of strings in dataset = 788,452 These atoms were obtained from rdkit.

Salts in dataset
In the dataset of 788,452 molecules, 15 molecules contained salts. These molecules were not de-salted. Among the molecules generated by the model; only the DNC model biased towards minimizing the value of logP contained 1 salted molecule out of the 4000 molecules it had generated. Molecules generated by all other models did not contain any salt.

(For results mentioned in Tables 3 and 4 of the main paper)
The memory size was chosen to be a power of 2 to keep in line with computer architecture. The embedding length is the same as the memory size so that there's no wastage/overuse of memory. The number of units is also the same as the memory size, so that each unit can handle one cell. The number of memory locations is greater than the length of the SMILE, so that there is enough memory for the whole SMILE. The number of controllers layers/read head/write heads are 1, as any more than that led to worse results.
We generated strings at 11000th and 13000th iterations when training a model with 2 read heads/write heads/controller layers and the percentage of valid strings were 77 % and 55% respectively which is a very steep downward decline. When considering that there were models which generated valid strings at greater than 90%, the results here did not seem worth pursuing.
This also keeps it in line with the stack RNN, where these are not configurable parameters.
The conv_shift_range, clip_value, learning_rate and max_grad_norm are the same as from https://arxiv.org/abs/1807.08518 which was the NTM paper from where we took our implementation. The init_mode is constant which gave the best results for the above mentioned paper. The batch size chosen was the maximum value which would fit in our GPU configuration considering the other memory requirements(number of memory units, embedding length etc)

DNC
(For results mentioned in Table 3 and 4 of the main paper) The DNC is more similar to the NTM compared to the stack RNN, hence the parameters are essentially the same.  Table 12 -Jaccard index * 10^5 for strings generated by each model at a particular iteration

Parameter Value
The Jaccard index for all of the models more or less increased with the number of iterations. This is expected as the more training data the model sees, the more similar its output becomes.
There is some variation, e.g. for stackRNN it decreases from 30000 to 35000, and for DNC it decreases from 25000 to 35000, but these can be considered a variation in training.
At the end of 45000 iterations, the NTM had the highest jaccard index, with the stack RNN the least. This meant that the NTM produced more molecules that were present in the training set. However, it should be noted that there was no explicit optimization done for reducing the amount of common strings between training data and what the model generates. Comparison of biased Stack RNN models As seen in the above data, when considering the number of benzene rings as the property to be biased, the StackRNN distribution shifts from the unbiased distribution towards the maximum in the maximized model and towards the minimum in the minimized model. It is also worth noting that the multi-modal appearance of the graph is because of the discrete nature of the predicted value. The distributions of all three memory augmented neural networks, before biasing, are nearly identical. The quartiles are identical with negligible differences in mean. Similar results are observed with the baseline models as well, with the exception of the vanilla RNN, which struggled to generate valid strings. This is the same as the models used in the unbiased LogP experiments (no biasing done). The training dataset is also very similar to the unbiased models. The minimized set of generators have all shifted their distribution from the unbiased models towards a lower number of benzene rings. It is evident that the biasing was more effective for the NTM and the DNC than it was for the StackRNN. Q3 for the StackRNN was 1 while it was 0 for the NTM and DNC. The DNC and NTM have comparable distributions, but the DNC was able to generate a higher set of valid strings (93.15% vs 89.26%). The NTM did produce molecules with lower synthetic accessibility strings (NTM: 2.66 vs DNC: 3.17 vs StackRNN: 3.44), although all of them produced a large number of strings lesser than the heuristic value of 6 (the approximate cutoff for molecules that are difficult to synthesize).

Minimization
The baseline models (particularly the GRU and the Vanilla RNN) were unable to bias their respective distributions towards minimizing benzene rings effectively. This is embodied in the nearly unaffected distributions in comparison to the unbiased models. The Vanilla RNN was also observed to overfit, with 42 unique strings out of 4000 generated strings. The LSTM was able to achieve some degree of biasing, but the memory-augmented RNNs (specifically the NTM and the DNC) were able to outperform the LSTM.  The maximized set of generators have all shifted their distribution from the unbiased models towards a higher number of benzene rings. There is a discernible difference between the three generated distributions, with the NTM outperforming the StackRNN and the DNC outperforming both models. The extent of biasing is the greatest for the DNC by a significant margin, and is the only model to have 25% of strings generated to have more than 4 benzene rings. The DNC and NTM also significantly outperform the StackRNN in terms of generation of valid strings, while maintaining an insignificant overlap with the training set. The median of the synthetic accessibility scores are also well within the permissible limit of 6.

Maximization Results
Similar to the case of minimization of benzene rings, the baseline models were unable to bias their respective distributions towards maximizing benzene rings effectively. While the Vanilla RNN seemed to bias more effectively than the GRU and LSTM models, the Vanilla RNN could only generate 218 unique strings out of 4000 generated. This suggests that the model overfit, and hence does not generate the novelty as captured by the common string percentage. The GRU and LST models biased far less effectively than the memory augmented models.

Predictor Results
The character level CNN achieved an MSE of 0.039 for the prediction of the number of Benzene rings in the SMILES representation of the molecule.

Common string percentage over entire data
It should be noted that the training dataset used was only half of what was available. When the common string percentage was calculated with the fully available dataset, the common string percentages increased by around 0.016, 0.04 and 0.02 for the stack RNN, NTM and DNC respectively. An increase is to be expected, but it is fairly small. For the biased models, the common string % is exactly the same when either half of the dataset is used, or the full dataset. The same is true for log P as well.

Average length of the SMILES strings generated
It can be seen in the tables below that the biasing has an effect on the length of the strings generated. The length is lower when trying to minimize the value of logP and when trying to minimize the number of benzene rings in the molecule. Whereas when trying to bias towards a higher value of logP and bias towards higher number of benzene rings in the molecule, longer strings are generated by the model.

Additional Metrics
The metrics are presented for all of our models. The 3 baseline models, 3 unbiased memory augmented networks, and 12 biased memory augmented networks (maximized/minimized log P/benzene)

MOSES Metrics
The following tables have metrics from MOSES.
30,000 strings were generated for each model as recommended by moses. The training set was chosen such that the average string length was around 37. The test/reference set in this case has an average string length of 48. We did not have a scaffold test set as that was not the focus of the project. There was no filtering done on our dataset, hence the filter values are not reported either.
It should be noted that while training, the only metric that was optimized was validity for the unbiased models. While the biased models were additionally optimized towards their target property.
The following metrics are reported 1. Frag/Test -Measures similarity of fragments between the test and generated set. 2. IntDiv/IntDiv2 -Measures diversity of generated test set. 3. Novelty -Measures proportion of generated molecules that are not present in the training set. 4. SNN -Measures similarity between fingerprints of a molecule in the generated set and it's nearest neighbour in the test set. 5. Scaf/Test -Measures similarity of Bemis-Murcko scaffolds between the test and generated set. 6. Unique -Measures uniqueness of molecules generated. 7. Valid -Measures percentage of valid molecules. These have proper valence, brackets that are closed etc.
There are Wasserstein-1 distances reported for molecular properties such as lipophilicity (logP), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED) and molecular weight.  From the above tables for the biased models, we can establish the following trends. The fragment similarity, SNN, and scaffold similarity lower for the biased models. This means that the biased models are producing molecules which are farther away from the dataset, which was our intention. On the other hand, metrics like internal diversity, novelty, uniqueness and validity stay fairly similar, or even become better for the biased models. We can conclude that these metrics are not compromised due to biasing.

Unbiased models
The following tables have the metrics for Wasserstein-1 distance of the molecular properties. The metrics are generally on the higher side, however, section 5. The QED generally reduces for the biased models.
The SA and log P increase for the biased models A possible explanation for SA increasing is that the biased molecules are inherently harder to synthesize. Log P increasing is expected as the molecules would inevitably be farther away from the dataset, whether minimized or maximized. The weight is fairly random in nature.
As some of the values were surprisingly low, we ran a control. The generated set in this case was taken from the test set. While the test set was a random sample of 100k molecules from the training set. It can be seen that these values do not differ too much compared to the molecules that were generated by our models.

Fréchet ChemNet Distance
We have used the Fréchet ChemNet Distance (FCD) as a quantitative measure of the similarity between the generated molecules and the dataset used to train the generator.  Table 49 -FCD for unbiased baseline models and biased memory augmented models As seen from the results above, the unbiased models of the StackRNN, NTM and DNC have a very low FCD to the training dataset used to train the generators. This indicates that these models, when only optimized towards validity of the SMILES format, are capable of learning and producing diverse molecules which possess chemical and biological properties similar to already known molecules (the training dataset).
When examining the values of FCD for all the biased models, it is much higher than the unbiased models, as expected. The models are biased towards producing molecules with a certain property, and hence are expected to have a larger dissimilarity with the overall training dataset.  However, for 128 units, the increase in stack memory seems to have no effect, with all the models performing similarly. Along with our results for the higher memory models where the LSTM performed comparably to the stack RNN, we can conclude for our dataset, that beyond a certain number of LSTM units(128 in our case), the benefit of augmented memory is negligible.

2-D representation of molecule generated
Below are the 2-D structures of a sample of the molecules generated that have been optimized towards minimizing the value of logP. The synthetic accessibility scores of the molecules are also calculated.